Data Derivation

The beginning dataset.

(df)

I first begin by preparing my original data from deliverable one to be joined with new data sources. My intent is to fill in much of the city, county, and state data for the injury (overdose), death, and residence locations. To prepare my data, I extracted the coordinates for latitude and longitude where applicable, to eventually use this as the joining factor with new data. I set these values to only two decimal places for ease of joining as the new data contains two decimal points for each. Additionally, I derived a new variable locationDistance which refers to the miles between the coordinates of death and injury (overdose).

df$latDeath <- df$DeathCityCoordinates %>%
  str_extract('\\d\\d\\.\\d+') %>%
  as.double()
df$longDeath <- df$DeathCityCoordinates %>%
  str_extract('\\-*\\d\\d\\.\\d+\\)') %>%
  str_extract('\\-*\\d\\d\\.\\d+') %>%
  as.double()

df$latInjury <- df$InjuryCityCoordinates %>%
  str_extract('\\d\\d\\.\\d+')%>%
  as.double()
df$longInjury <- df$InjuryCityCoordinates %>%
  str_extract('\\-*\\d\\d\\.\\d+\\)') %>%
  str_extract('\\-*\\d\\d\\.\\d+') %>%
  as.double()

df$latResidence <- df$ResidenceCityCoordinates %>%
  str_extract('\\d\\d\\.\\d+')%>%
  as.double()
df$longResidence <- df$ResidenceCityCoordinates %>%
  str_extract('\\-*\\d\\d\\.\\d+\\)') %>%
  str_extract('\\-*\\d\\d\\.\\d+') %>%
  as.double()

#calculates distance between overdose and death then converts meters to miles
df$locationDistance <- distHaversine(cbind(df$longDeath, df$latDeath), cbind(df$longInjury, df$latInjury))
df$locationDistance <- conv_unit(df$locationDistance, "m", "mi")


df$latDeath <- round(df$latDeath, digits=2)
df$longDeath <- round(df$longDeath, digits=2)
df$latInjury <- round(df$latInjury, digits=2)
df$longInjury <- round(df$longInjury, digits=2)
df$latResidence <- round(df$latResidence, digits=2)
df$longResidence <- round(df$longResidence, digits=2)

#format change for merging data
df$ResidenceCity <- gsub("FALLS VILLAGE", "FALLS", df$ResidenceCity)

Data Additions

I now begin incorporating my new data into my project. Through the use of web scraping, I obtain the latitude and longitude for cities in Connecticut and format the values to join appropriately with my data.

#formatted to reduce timeout errors
url = "https://www.mapsofworld.com/usa/states/connecticut/lat-long.html"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
web <- read_html("scrapedpage.html")

my_paragraphs = html_nodes(web, "tbody")
my_tr = html_nodes(my_paragraphs, "tr")

cityname <- my_tr %>%
  html_nodes("td:first_child") %>%
  html_text()
citylat <- my_tr %>%
  html_nodes("td:nth_child(2)") %>%
  html_text() %>%
  as.double()
citylong <- my_tr %>%
  html_nodes("td:nth_child(3)") %>%
  html_text() %>%
  as.double


citycoordinates <- cbind.data.frame(cityname=cityname, lat=citylat, long=citylong) 
citycoordinates <- citycoordinates[1:142,]

#format to match current data on merge
citycoordinates$cityname <- citycoordinates$cityname %>%
  toupper() 
citycoordinates$cityname <-  gsub(" CITY", "", citycoordinates$cityname) 
citycoordinates$cityname <- gsub(" BOROUGH", "", citycoordinates$cityname)
citycoordinates$cityname <- gsub("JEWETT", "JEWETT CITY", citycoordinates$cityname)
citycoordinates$cityname <- gsub(" VILLAGE", "", citycoordinates$cityname)
citycoordinates$cityname <- gsub(" \\(BALANCE\\)", "", citycoordinates$cityname)
citycoordinates$cityname <- gsub("MILFORD ", "MILFORD", citycoordinates$cityname)

My web scraped data does not contain information on the county of each city, so I incorporate a third dataset. My third dataset includes a list of all the cities in Connecticut and the associating county. I format this to join with my web scraped data on the matching city names to now have a resulting dataset that includes the latitude, longitude, city and state.

countyinfo <- "uscities.csv" %>%
  read.csv() %>%
  filter(state_name == "Connecticut") %>%
  select(c("city", "county_name", "state_name"))


#format column names and city names to help join datasets
countyinfo$city <- toupper(countyinfo$city)
countyinfo$county_name <- toupper(countyinfo$county_name)


countyinfo$state_name <- as.character.factor(countyinfo$state_name)
countyinfo$state_name <- "CT"

colnames(countyinfo)[colnames(countyinfo)=="city"]<-"cityname"
countyinfo$cityname <-  gsub(" VILLAGE", "", countyinfo$cityname)
countyinfo$cityname <-  gsub("GLENVILLE", "GREENWICH", countyinfo$cityname)
countyinfo$cityname <-  gsub("MILFORD CITY", "MILFORD", countyinfo$cityname) 
cit <- merge(x = countyinfo, y = citycoordinates, by = "cityname", all.x = TRUE)

#Add two missing row values
cit$lat <- ifelse(cit$cityname=="MILFORD ", 41.23,   cit$lat)
cit$long <- ifelse(cit$cityname=="MILFORD ", -73.06,   cit$long)
cit$lat <- ifelse(cit$cityname=="PLANTSVILLE", 41.58,   cit$lat)
cit$long <- ifelse(cit$cityname=="PLANTSVILLE", -72.89,   cit$long)

The following is the resulting dataset with each city’s latitude, longitude, and county.

(cit)

Now that the data has been formatted correctly, I join the new data with my original data (which I will reference as df from now on). As the values are not going to have exact matches between the df and new data, I have to join the datasets by determining the closest match of latitude and longitude values. After joining the data together, I substitute all blank values where applicable. I have printed the number of city, county, and state empty values for the death location before and after this join and substitution to show the results of this step.

setDT(df)
setDT(cit)
df[, `:=`(lat.join = latDeath, long.join = longDeath)]
cit[, `:=`(lat.join = lat, long.join = long)]
df <- cit[df, on = c("lat.join", "long.join"),roll = "nearest"]


sum(is.na(df$DeathCity))
## [1] 5
sum(is.na(df$DeathCounty))
## [1] 1101
df$DeathCity <- ifelse(is.na(df$DeathCity), df$cityname,  df$DeathCity)
df$DeathCounty <- ifelse(is.na(df$DeathCounty), df$county_name,  df$DeathCounty)
sum(is.na(df$DeathCity))
## [1] 0
sum(is.na(df$DeathCounty))
## [1] 322

I now repeat the previous steps for both Residence as well as Injury (overdose) locations.

df[, `:=`(lat.join = latResidence, long.join = longResidence)]
df <- cit[df, on = c("lat.join", "long.join"),roll = "nearest"]


df$ResidenceCity <- ifelse(is.na(df$ResidenceCity), df$cityname,  df$ResidenceCity)
df$ResidenceCounty <- ifelse((is.na(df$ResidenceCounty)), df$county_name,  df$ResidenceCounty)
df$ResidenceState <- ifelse((is.na(df$ResidenceState)), df$state_name,  df$ResidenceState)
df[, `:=`(lat.join = latInjury, long.join = longInjury)]
df <- cit[df, on = c("lat.join", "long.join"),roll = "nearest" ]


df$InjuryCity <- ifelse(is.na(df$InjuryCity), df$cityname,  df$InjuryCity)
df$InjuryCounty <- ifelse(is.na(df$InjuryCounty), df$county_name,  df$InjuryCounty)
df$InjuryState <- ifelse(is.na(df$InjuryState), df$state_name,  df$InjuryState)

The concluding dataset.

(df)

Models

Within this project, I became very interested in the location people were declared dead compared to where they overdosed. I believed there could be correlations and a relationship between these locations. While the relationship could be explained by countless different variables like the time and distance from emergency services the overdose occured, to those who died being alone and unable to call 911, or those who witnessed the overdose and did not contact 911, I use the data I have to see if there’s a strong enough relationship to make predictions of the distance between death and overdose. I begin by creating a model to predict the distance with age, sex, number of drugs, city of overdose (InjuryCity), and the type of location (residence, hospital, etc) in a multiple linear regression. There is a relationship, but not strong enough to confidently make predictions, with most variations of the summary resulting in a confidence around 50% of predicting the correct distance (indicated by the R2 value).

set.seed(385)
samp <- df %>%
  filter(!is.na(df$locationDistance), !is.na(df$InjuryCity), !is.na(df$Age), 
           !is.na(df$Sex), !is.na(df$numberOfDrugs),  !is.na(df$Location)) 

sample_selection <- samp$locationDistance %>%
  createDataPartition(p=0.75, list = FALSE)
train <- samp[sample_selection, ]
test <- samp[-sample_selection, ]
train_model <- lm(locationDistance ~ Age + Sex + numberOfDrugs + InjuryCity + Location, data=samp)


prediction <- train_model %>% predict(test)
head(prediction)
##           8          12          19          20          29          30 
## 23.83378965 22.25884941 25.16473351 -0.05517411  2.40833553 25.13576336
head(samp$locationDistance)
## [1] 47.19019 59.83523 12.68688 12.92604 20.82620 14.29051
R2(prediction, test$locationDistance)
## [1] 0.6156041
ggplot(test, aes(x=prediction, y=locationDistance))+geom_point(color = "blue") +
  labs(title='Distance Between Death and Overdose vs Predicted (v1)', x='Distance From Death Location (miles)', y='Predicted Distance from Death Location (miles)')

The resulting summary from the model above (redacted due to the large number of factors within the InjuryCity variable) showed the most influential variables with a strong relationship to the distance (as indicated by the lowest p values) to be the city of overdose and the type of location. Due to this, I now updated my model to include only these variables. While the confidence in prediction has increased, it has not increased enough to consider the model strong enough to make confident predictions. We now can expect to accurately derive the location distance around 63% of the time.

samp <- df %>%
  filter(!is.na(df$locationDistance), !is.na(df$InjuryCity), !is.na(df$Location)) 

sample_selection <- samp$locationDistance %>%
  createDataPartition(p=0.90, list = FALSE)
train <- samp[sample_selection, ]
test <- samp[-sample_selection, ]
train_model <- lm(locationDistance ~ InjuryCity + Location, data=samp)


prediction <- train_model %>% predict(test)
head(prediction)
##          11          32          40          42          70          71 
## -0.62951882  0.07511893  2.36858841 22.45277821  0.18891288 22.45277821
head(samp$locationDistance)
## [1] 47.19019 59.83523 12.68688 12.92604 20.82620 14.29051
R2(prediction, test$locationDistance)
## [1] 0.6324546
  ggplot(test, aes(x=prediction, y=locationDistance))+geom_point(color = "blue")  +
  labs(title='Distance Between Death and Overdose vs Predicted (v2)', x='Distance From Death Location (miles)', y='Predicted Distance from Death Location (miles)')

New Visualizations

We can see with this first visualization the majority of deaths are occuring within residences, and the likelihood of being declared dead at the scene of the overdose is likely. This observation could be indicative of the potential problem of emergency response access and societal implications that I will further look into in deliverable three.

filter(df, !is.na(locationDistance)) %>%
  filter(!is.na(Location)) %>%
  ggplot(aes(x=locationDistance))+ geom_line(aes(y=..count.., color=Location), stat="bin", binwidth=50)+
  labs(title='Location of Death vs Overdose', x='Distance between Overdose and Death (miles)', y='Number of Deaths')

filter(df, !is.na(Location), !is.na(locationDistance), !is.na(InjuryCounty), InjuryCounty!="WASHINGTON", InjuryCounty!="WESTCHESTER") %>%
  ggplot(aes(x=InjuryCounty, y=locationDistance)) + geom_point(aes(color=Location)) +
  labs(title="Distance Between Overdose and Death by County", x="Overdose County", y="Distance from Death Location (miles)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

filter(df, !is.na(Year)) %>%
  ggplot(aes(x=Year, y=..count..))+
  geom_bar(aes(fill=Fentanyl)) + 
  labs(title='Overdoses Involving Fentanyl over Time', x='Year', y='Number of Deaths') +
  geom_text(stat='count', aes(label=..count..), position=position_dodge(width=0.9), vjust=-0.25)

ggplot(df, aes(x=Fentanyl, y=..count..))+
  geom_bar(aes(fill=Location)) + 
  labs(title='Overdoses Involving Fentanyl by Location Type', x='Fentanyl Involved', y='Number of Deaths') +
  geom_text(stat='count', aes(label=..count..), position=position_dodge(width=0.9), vjust=-0.25)

Updated Visualizations

The two following visualizations show the added value of incorporating the two new datasets through an increase in values displayed, and less empty values from the same visualizations in deliverable one. With the new added data, I can make more confident predictions and correlations with the locations as a factor.

filter(df, !is.na(Year)) %>%
  ggplot(aes(x=Year, y=..count..))+
  geom_bar(aes(fill=factor(DeathCounty))) + 
  labs(title='Number of Accidental Drug Related Deaths 2012-2018 in Connecticut by Year', x='Year', y='Number of Deaths') +
  geom_text(stat='count', aes(label=..count..), position=position_dodge(width=0.9), vjust=-0.25)

filter(df, !is.na(DeathCounty)) %>% 
  ggplot(aes(x=DeathCounty)) +
  geom_bar(stat="count") + 
  labs(title='Number of Accidental Drug Related Deaths 2012-2018 in Connecticut by County', x='County', y='Number of Deaths') + 
  theme(axis.text.x=element_text(angle=90, hjust=1)) +
  geom_text(stat='count', aes(label=..count..), position=position_dodge(width=0.9), vjust=-0.25)

Conclusion

Overall, the addition of new data proved to be beneficial for creating more models, however, the models were not the strongest and will be worked on within the next deliverable. There is a major importance within the model I am attempting to address, as it could indicate problems within the state of Connecticut, whether it be emergency response times, lack of access to emergency responders, and more. I plan to attempt bringing in one last dataset to provide a new variable depicting the distance between overdose and emergency services.